[zephyr/tokenize] Use bulk list-objects for file sizes, delete filescan job by rjpower · Pull Request #4658 · marin-community/marin

rjpower · 2026-04-11T17:40:35Z

What

Replace per-file fsspec_size() stat RPCs with fsspec.glob(detail=True), which returns file sizes from the same list-objects API call that discovers files. Zero extra RPCs.

This eliminates the tokenize-filescan Zephyr job (#4341) that launched 32 distributed workers just to stat files one at a time. On nemotron hq_actual (2,755 files, 1 TB), glob(detail=True) takes ~2s — the same cost as a plain glob.

Zephyr changes

Dataset.from_files() now stores a lazy GlobSource op instead of eagerly globbing at construction time
compute_plan() resolves GlobSource → FileEntry objects (path + size) via resolve_glob()
InputFileSpec gains a size field populated from the bulk listing
_compute_file_pushdown now takes list[FileEntry] instead of list[str]

Tokenize changes

Delete the tokenize-filescan Zephyr job and all its machinery (fsspec_size, batched stat workers)
Add _glob_with_sizes() — takes a list of patterns, returns [{"filename", "size"}] using detail=True
Add _expand_tokenize_paths() — directory → recursive extension globs
Both TokenizeConfig and HfTokenizeConfig paths now go through the same glob → bundle flow
Remove dead InputName/ExecutorStep early-return branch

Benchmark

nemotron hq_actual (2,755 files, 1 TB):
  glob(detail=True): ~2s  (same cost as plain glob — sizes are free)
  Previous: 32-worker Zephyr job doing individual stats

detail=True works identically for gs://, hf://, s3://, and local filesystems.

Part of #4411, part of #4587

…-file stats Replace per-file fsspec stat RPCs with fsspec glob(detail=True), which returns file sizes from the same list-objects API call that discovers files — zero extra RPCs needed. Zephyr changes: - Dataset.from_files() now stores a lazy GlobSource instead of eagerly globbing. Resolution happens at plan time in compute_plan(). - resolve_glob() uses glob(detail=True) to produce FileEntry objects (path + size) from the bulk listing response. - InputFileSpec gains a size field, populated from the bulk listing. Tokenize changes: - Delete the tokenize-filescan Zephyr job (introduced in #4341) that launched 32 distributed workers just to stat files one at a time. - Replace it with _glob_with_sizes() which gets sizes for free from the same glob call. Works for gs://, hf://, s3://, and local paths. - Remove dead InputName/ExecutorStep early-return branch from _get_filepaths_to_tokenize. Benchmark on nemotron hq_actual (2,755 files, 1 TB): glob(detail=True): ~2s (same cost as plain glob) Previous approach: 32-worker Zephyr job doing individual stats Part of #4411, part of #4587

claude · 2026-04-11T17:40:52Z

Code Review in Progress

Check if PR needs review
Collect relevant CLAUDE.md / AGENTS.md files
Summarize PR changes
Run 4 parallel review agents
Validate each found issue
Post review (inline comments or summary)

View job run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 98604d2e02

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

rjpower · 2026-04-12T00:00:50Z

@claude strip provenance.json from the validation/test sets as well, as per PR feedback

claude · 2026-04-12T00:01:26Z

Claude finished @rjpower's task in 2m 43s —— View job

Done — provenance.json filter applied to validation files

Read context and PR feedback
Read lib/marin/src/marin/processing/tokenize/tokenize.py
Locate provenance.json filtering logic (line 339 — train only)
Apply same filter to validation files (covers HF test split too, since HfTokenizeConfig loads data_files["test"] into validation_patterns)
Pre-commit: clean (ruff/black/pyrefly all ok)
Tests: uv run pytest tests/processing/tokenize/test_tokenize.py — 20 passed, 1 deselected (the @pytest.mark.slow integration test)
Commit and push (cbd0aee4b)

Change

train_file_stats = _glob_with_sizes(train_patterns)
train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]
validation_file_stats = _glob_with_sizes(validation_patterns)
validation_file_stats = [f for f in validation_file_stats if "provenance.json" not in f["filename"]]  # new

One-line mirror of the existing train filter in lib/marin/src/marin/processing/tokenize/tokenize.py. Since HfTokenizeConfig routes the test split through validation_patterns (line 333), this also covers HF test splits.
· Branch

The bulk glob flow previously stripped provenance.json only from train files, so directory-based validation inputs could include metadata JSON in the tokenization stream. Apply the same filter to validation files. Co-authored-by: Russell Power <rjpower@users.noreply.github.com>

ravwojdyla · 2026-04-13T17:20:37Z

-                output_paths.extend(fsspec_glob(os.path.join(path, f"**/*.{ex}")))
-        else:
-            output_paths.extend(fsspec_glob(path))
+_TOKENIZE_EXTENSIONS = ["json.{gz,zst,zstd}", "jsonl.{gz,zst,zstd}", "parquet", "json"]


is plain .json safe here?

we can probably remove, probably less safe than useful i guess, too many chances for weird things

ravwojdyla · 2026-04-13T17:22:43Z

-    if not train_paths and not validation_paths:
+    # Resolve patterns → concrete files with sizes (single list-objects call per pattern)
+    train_file_stats = _glob_with_sizes(train_patterns)
+    train_file_stats = [f for f in train_file_stats if "provenance.json" not in f["filename"]]


"provenance.json" special-cases seems brittle

yeah i'll figure out a more appropriate setup

the old code has this as well. I don't see a great way around it unfortunately.

we should really be writing our own metadata into isolated directories like dir/.marin/provenance.json, but of course we have a lot of leftover datasets with this in it.

I changed it to use a constant set of known metadata files & filter those out (and we don't look for plain .json anymore which should avoid most of the issues...)

ravwojdyla · 2026-04-13T17:28:06Z


    path: str
    format: Literal["parquet", "jsonl", "vortex", "auto"] = "auto"
+    size: int | None = None


docstring not updated. size here feels a bit off, InputFileSpec would imply the "specification to read the data", isn't size here a metadata and not a specification?

moved off of the filespec

….json auto, named sidecar filter - InputFileSpec is now a pure read-spec (size field removed) - FileEntry holds spec + size; readers see InputFileSpec only, planners still size shards via FileEntry.size - _TOKENIZE_EXTENSIONS no longer auto-globs plain .json (avoids matching sidecars) - _MARIN_SIDECAR_NAMES + _drop_sidecars replaces brittle substring filter; applied uniformly to train and validation - Raise when a configured split (train or validation) resolves to zero files (codex P1)

ravwojdyla · 2026-04-13T20:07:56Z


-def _glob_with_sizes(patterns: list[str]) -> list[dict]:
-    """Glob patterns and return [{"filename": path, "size": bytes}].
+# NOTE(chris): Marin's `default_download` writes a `provenance.json` sidecar next to


who's chris?

ravwojdyla · 2026-04-13T20:09:11Z

 class InputFileSpec:
    """Specification for reading a file or portion of a file.

+    Pure read-spec: everything here is caller-supplied. Discovered metadata


I dislike these kinda of side-effect comments from claude :/

rjpower added the agent-generated Created by automation/agent label Apr 11, 2026

Remove benchmark script

98604d2

rjpower requested a review from ravwojdyla April 11, 2026 17:42

rjpower enabled auto-merge (squash) April 11, 2026 17:42

chatgpt-codex-connector bot reviewed Apr 11, 2026

View reviewed changes

Comment thread lib/marin/src/marin/processing/tokenize/tokenize.py Outdated

Comment thread lib/marin/src/marin/processing/tokenize/tokenize.py Outdated

Fix tests.

4ad070f

rjpower requested a review from yonromai April 13, 2026 17:01

ravwojdyla requested changes Apr 13, 2026

View reviewed changes

ravwojdyla approved these changes Apr 13, 2026

View reviewed changes

rjpower merged commit af7bc76 into main Apr 13, 2026
51 checks passed

rjpower deleted the zephyr-fast-filescan branch April 13, 2026 20:09

Conversation

rjpower commented Apr 11, 2026

What

Zephyr changes

Tokenize changes

Benchmark

Uh oh!

claude bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review in Progress

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

rjpower commented Apr 12, 2026

Uh oh!

claude bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Done — provenance.json filter applied to validation files

Change

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude bot commented Apr 11, 2026 •

edited

Loading

claude bot commented Apr 12, 2026 •

edited

Loading